62

Algorithms for Binary Neural Networks

tors. The discriminators try to distinguish the “real” from the “fake,” and the generator

tries to make the discriminators unable to work well. The result is a rectified process and a

unique architecture with a more precise estimation of the full precision model. Pruning is

also explored to improve the applicability of the 1-bit model in practical applications in the

GAN framework. To accomplish this, we integrate quantization and pruning into a unified

framework.

3.6.1

Loss Function

The rectification process combines full precision kernels and feature maps to rectify the

binarization process. It includes kernel approximation and adversarial learning. This learn-

able kernel approximation leads to a unique architecture with a precise estimation of the

convolutional filters by minimizing kernel loss. Discriminators D(·) with filters Y are intro-

duced to distinguish feature maps R of the full precision model from those T of RBCN. The

RBCN generator with filters W and matrices C is trained with Y using knowledge of the

supervised feature maps R. In summary, W, C and Y are learned by solving the following

optimization problem:

arg min

W, ˆ

W ,C

max

Y

L = LAdv(W, ˆW, C, Y ) + LS(W, ˆW, C) + LKernel(W, ˆW, C),

(3.62)

where LAdv(W, ˆW, C, Y ) is the adversarial loss as

LAdv(W, ˆW, C, Y ) = log(D(R; Y )) + log(1D(T; Y )),

(3.63)

where D(·) consists of a series of basic blocks, each containing linear and LeakyRelu layers.

We also have multiple discriminators to rectify the binarization training process.

In addition, LKernel(W, ˆW, C) denotes the kernel loss between the learned full precision

filters W and the binarized filters ˆW and is defined as:

LKernel(W, ˆW, C) = λ1/2||WC ˆW||2,

(3.64)

where λ1 is a balance parameter. Finally, LS is a traditional problem-dependent loss, such

as softmax loss. The adversarial, kernel, and softmax loss are regularizations on L .

For simplicity, the update of the discriminators is omitted in the following description

until Algorithm 13. We also have omitted log(·) and rewritten the optimization in Eq. 6.79

as in Eq. 3.65 for simplicity.

min

W, ˆ

W ,C

LS(W, ˆW, C) + λ1/2



l



i

||W l

i Cl ˆW l

i ||2 +



l



i

||1D(T l

i ; Y )||2.

(3.65)

where i represents the ith channel and l represents the lth layer. In Eq. 3.65, the objective

is to obtain W, ˆW and C with Y fixed, which is why the term D(R; Y ) in Eq. 6.79 can

be ignored. The update process for Y is found in Algorithm 13. The advantage of our

formulation in Eq. 3.65 lies in that the loss function is trainable, which means it can be

easily incorporated into existing learning frameworks.

3.6.2

Learning RBCNs

In RBCNs, convolution is implemented using W l, Cl and F l

in to calculate output feature

maps F l

out as

F l

out = RBConv(F l

in; ˆW l, Cl) = Conv(F l

in, ˆW lCl),

(3.66)